Skip to content

feat: dynamo inference backend integration#2737

Open
biswapanda wants to merge 19 commits into
PrimeIntellect-ai:mainfrom
biswapanda:rl-sdk-4
Open

feat: dynamo inference backend integration#2737
biswapanda wants to merge 19 commits into
PrimeIntellect-ai:mainfrom
biswapanda:rl-sdk-4

Conversation

@biswapanda

@biswapanda biswapanda commented Jun 9, 2026

Copy link
Copy Markdown

Description

replaced by - #2773

End-to-end support for running prime-rl RL training against NVIDIA Dynamo (GB200/GB300) alongside the existing vLLM path. Adds a Dynamo inference backend, NCCL/filesystem weight transfer for GB200, vLLM 0.22 patches, MoE routed-experts capture + replay, and the deploy tooling (image, helm, k8s manifests) to run it.

Highlights

  • Dynamo backend: AdminAPI abstraction + backend selector (client.backend = vllm | dynamo) + RL worker discovery (GET /v1/rl/workers) for Dynamo-served inference.
  • Weight transfer: NCCL broadcast + FP8/E8M0 conversion for GB200 (qwen3_moe / glm_moe), plus an NFS-safe filesystem broadcast path with weight_broadcast.keep_recent.
  • routed_experts (MoE expert replay): the orchestrator decodes the {data, shape, start, dtype} payload dtype-aware (uint8/uint16, normalizing uint16→int32 for the trainer; int32 fallback for >65535 experts), and the trainer replays the captured routing so recomputed logprobs match inference. Inference forwards moe_backend and auto-selects triton when router replay is enabled — the default FlashInfer fused MoE kernel bypasses the capture hook (→ all-zero routing), so a non-fused backend is required.
  • Orchestrator: dispatch compute_teacher_logprobs by renderer_transport (vLLM generate vs Dynamo nvext TITO); stop sending return_token_ids for Dynamo compatibility.
  • Inference: vLLM 0.22 patches — fp32 lm-head, int64 silu_mul_quant, padded scrub.
  • Deploy: Dockerfile.cuda.runtime (vLLM 0.22, DeepGEMM) + Dockerfile.dynamo, helm chart updates, Dynamo k8s manifests (client example sets backend=dynamo), and tools/dynamo run/smoke scripts.

Type of Change

  • New feature (non-breaking change which adds functionality)

Review

Codex adversarial review: SIGN-OFF (head 1b5917a). The 2 remaining review threads are non-routed_experts production-path follow-ups, flagged with fixes: weight-update pause retries, and broadcast keep_recent should be ≥ orchestrator.max_off_policy_steps.

Validation

3-GPU GB200 (1 inference + 2 FSDP trainer), Qwen3-30B-A3B-Thinking, router replay + moe_backend=triton: 10-step RL run with Mismatch KL 0.0002–0.0005 every step (faithful routing replay, no drift), no errors/OOM, stable memory.

Notes

Companion to PrimeIntellect-ai/verifiers#1574 and PrimeIntellect-ai/renderers#79 (the dynamo_chat TITO transport this orchestrator path drives). The deps commit repoints the verifiers/renderers submodules at biswapanda forks pending those PRs merging.


Note

High Risk
Touches NCCL weight broadcast, inference weight reload (E8M0/FP8), orchestrator–inference admin contracts, and large vLLM runtime patches; misconfiguration can break training sync or serving on GPU clusters.

Overview
Adds NVIDIA Dynamo as an alternate inference backend (client.backend: vllm | dynamo) via an AdminAPI abstraction (VLLMAdminAPI vs DynamoAdminAPI on /engine/*), RL worker discovery (GET /v1/rl/workers, rl_base_url), and renderer_transport=dynamo_chat for nvext rollouts. Orchestrator stops defaulting return_token_ids for Dynamo; teacher logprobs dispatch on transport (vLLM generate vs Dynamo chat/nvext).

Weight sync & GB200: Filesystem broadcast gains configurable keep_recent, fsync-before-STABLE, and retention-aware cleanup; NCCL broadcast adds per-layer dist.barrier + CUDA sync. Inference reload handles DeepGEMM E8M0 scale layout; Qwen3 MoE can export vLLM kernel/FP8 weights. vLLM patches add int64 DeepGEMM SiLU/mul quant, fp32 lm-head idempotency, and dtype-aware routed_experts capture/replay (moe_backend, auto triton when router replay is on).

Deploy: New Dockerfile.cuda.runtime (cuda-dl-base devel for NVRTC/tilelang + python3.12-dev), Dynamo k8s examples (DGD, ConfigMap, Helm values with inference disabled), Helm chart extensions (ConfigMap mounts, existingClaim, DRA resource claims, tolerations/pull secrets), and tools/dynamo launch/smoke scripts.

Reviewed by Cursor Bugbot for commit 08bb4ea. Bugbot is set up for automated code reviews on this repo. Configure here.

@biswapanda biswapanda changed the title feat: Dynamo (GB200) inference backend + weight transfer + deploy tooling feat: Dynamo inference backend integration Jun 9, 2026
@biswapanda biswapanda changed the title feat: Dynamo inference backend integration feat: dynamo inference backend integration Jun 9, 2026
Comment thread src/prime_rl/inference/vllm/worker/nccl.py
Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py
Comment thread k8s/prime-rl/templates/deployment.yaml
Comment thread src/prime_rl/utils/client.py
…t; bump verifiers/renderers deps to rl-sdk-4 heads
Comment thread src/prime_rl/utils/client.py
Comment thread src/prime_rl/trainer/rl/train.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2c61937. Configure here.

Comment thread k8s/dynamo-deploy/prime-rl-configs.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant